unsupervised.tex - PhD - The LaTeX sources of my Ph.D. thesis

unsupervised.tex (54296B)
      1 \section{Unsupervised Extraction Models}
      2 \label{sec:relation extraction:unsupervised}
      3 \begin{epigraph}
      4 		{Yann LeCun}
      5 		{Inaugural Lecture at Collège de France}
      6 		{2016}
      7 	% This translation comes from a Facebook post made by Yann LeCun himself, but the Collège de France reference should be easier to find.
      8 	If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake.
      9 \end{epigraph}
     10 In the unsupervised setting, no samples are labeled with a relation, i.e.~all samples are triplets (sentence, head entity, tail entity) from \(\dataSet\subseteq\sentenceSet\times\entitySet^2\).
     11 Furthermore, no information about the relation set \(\relationSet\) is available.
     12 This is problematic since whether a specific semantic link is worthy of appearing in \(\relationSet\) or not is not well defined.
     13 Having so little information about what constitutes a relation makes the problem intractable if we do not impose some restrictions upon \(\relationSet\).
     14 All unsupervised models presented in this section are not universal and make some kind of assumption on the structure of the data or on its underlying knowledge base.
     15 However, developing unsupervised relation extraction models is still interesting for three reasons: they
     16 \begin{itemize*}
     17 	\item[(1)] do not necessitate labeled data except for validating the models;
     18 	\item[(2)] can uncover new relation types; and
     19 	\item[(3)] can be trained from large unlabeled datasets and then fine-tuned for specific relations.
     20 \end{itemize*}
     21 
     22 For all models, we list the important modeling hypothesis such as \hypothesis{1-adjacency} and \hypothesis{pullback} introduced previously.
     23 Appendix~\ref{chap:assumptions} contains a list of assumptions with some counterexamples and references to the sections where they were introduced.
     24 We strongly encourage the reader to refer to it, especially when the implications of a modeling hypothesis is not immediately clear.
     25 
     26 \subsection{Evaluation}
     27 \label{sec:relation extraction:unsupervised evaluation}
     28 The output of unsupervised models vary widely.
     29 The main modus operandi can be categorized into two categories:
     30 \begin{description}
     31 	\item[Clustering] A first approach is to cluster the samples such that all samples in the same cluster convey the same relation and samples in different clusters convey different relations.
     32 	\item[Similarity Space] A second approach is to associate each sample with an element of a vector space equipped with a similarity function.
     33 		If two samples are similar in this vector space, they convey similar relations.
     34 		This can be seen as a soft version of the clustering approach.
     35 \end{description}
     36 
     37 This distinction has an impact on how we evaluate the models.
     38 In the first case, standard clustering metrics are used.
     39 We introduce \bcubed{} \parencite{bcubed}, V-measure \parencite{v-measure} and \textsc{ari} \parencite{ari} in Section~\ref{sec:relation extraction:clustering}.
     40 They are the most prevalent metrics in cluster evaluation, \bcubed{} in particular is widely used in unsupervised relation extraction.
     41 In the second case, a few-shot evaluation can be used \parencite{fewrel}.
     42 We introduce this approach in Section~\ref{sec:relation extraction:few-shot}.
     43 
     44 A difficulty of evaluating unlabeled clusters is that we do not know which cluster should be compared to which relation.
     45 A possible solution to this problem is to use a small number of labeled samples, which can be used to constrain the output of a model to fall into a specific relation set \(\relationSet\).
     46 This setup is actually similar to semi-supervised approaches such as label propagation (Section~\ref{sec:relation extraction:label propagation}), except that the model must be trained in an unsupervised fashion before being fine-tuned on the supervised dataset.
     47 Similar to the label propagation model evaluation, unsupervised models evaluated by fine-tuning on a supervised dataset usually report performance varying the number of train labels.
     48 These performances are measured using the standard supervised metrics introduced in Section~\ref{sec:relation extraction:supervised evaluation}.
     49 Evaluating performances as a pre-training method can be used for all unsupervised models, in particular similarity-space-based approaches.
     50 
     51 \subsubsection{Clustering Metrics}
     52 \label{sec:relation extraction:clustering}
     53 In this section, we describe three metrics used to evaluate clustering approaches.
     54 The first metric, \bcubed{} was first introduced to unsupervised relation extraction by rel-\textsc{lda} (\cite{rellda}, Section~\ref{sec:relation extraction:rellda}), while the other two were proposed as complements by \textcite{fitb} presented in Chapter~\ref{chap:fitb}.
     55 
     56 \begin{epigraph}
     57 		{Valve}
     58 		{``Portal''}
     59 		{2007}
     60 		The cake is a lie.
     61 \end{epigraph}
     62 To clearly describe these different clustering metrics, we propose a common probabilistic formulation---in practice, these probabilities are estimated on the validation and test sets---and use the following notations.
     63 Let \(\rndm{X}\) and \(\rndm{Y}\) be random variables corresponding to samples in the dataset.
     64 Following Section~\ref{sec:relation extraction:supervised evaluation}, we denote by \(c(\rndm{X})\) the predicted cluster of \(\rndm{X}\) and \(g(\rndm{X})\) its conveyed gold relation.%
     65 \sidenote{
     66 	This implies that a labeled dataset is sadly necessary to evaluate an unsupervised clustering model.
     67 }
     68 
     69 \paragraph{B\texorpdfstring{\textsuperscript{3}}{³}}
     70 The metric most commonly computed for unsupervised model evaluation is a generalization of \fone{} for clustering tasks called \bcubed{} \parencitex{bcubed}.
     71 The \bcubed{} precision and recall are defined as follows:
     72 \begin{align*}
     73 	\bcubed \operatorname{precision}(g, c) & = \expectation_{\rndm{X},\rndm{Y}\sim\uniformDistribution(\dataSet_\relationSet)} P\left(g(\rndm{X})=g(\rndm{Y}) \mid c(\rndm{X})=c(\rndm{Y})\right) \\
     74 	\bcubed \operatorname{recall}(g, c)    & = \expectation_{\rndm{X},\rndm{Y}\sim\uniformDistribution(\dataSet_\relationSet)} P\left(c(\rndm{X})=c(\rndm{Y}) \mid g(\rndm{X})=g(\rndm{Y})\right) \\
     75 \end{align*}
     76 As precision and recall can be trivially maximized by putting each sample in its own cluster or by clustering all samples into a single class, the main metric \bcubed{} \fone{} is defined as the harmonic mean of precision and recall:
     77 \begin{equation*}
     78 	\bcubed \fone{}(g, c) = \frac{2}{\bcubed{} \operatorname{precision}(g, c)^{-1} + \bcubed{} \operatorname{recall}(g, c)^{-1}}
     79 \end{equation*}
     80 
     81 While the usual precision (Section~\ref{sec:relation extraction:supervised evaluation}) can be seen as the probability that a sample with a given prediction is correct, the \bcubed{} precision cannot use the correct relation as a reference to determine the correctness of a prediction.
     82 Instead, whether an assignment is correct is computed as the expectation that a sample is accurately classified relatively to all other samples grouped in the same cluster.
     83 
     84 \paragraph{V-measure}
     85 Another metric is the entropy-based V-measure \parencitex{v-measure}.
     86 This metric is defined by homogeneity and completeness, which are akin to \bcubed{} precision and recall but rely on conditional entropy.
     87 For a cluster to be homogeneous, we want most of its elements to convey the same gold relation.
     88 In other words, the distribution of gold relations inside a cluster must have low entropy.
     89 This entropy is normalized by the unconditioned entropy of the gold relations to ensure that it does not depend on the size of the dataset:
     90 \begin{equation*}
     91 	\operatorname{homogeneity}(g, c) = 1 - \frac{\entropy\left(c(\rndm{X})\mid g(\rndm{X})\right)}{\entropy\left(c(\rndm{X})\right)}.
     92 \end{equation*}
     93 Similarly, for a cluster to be complete, we want all the elements conveying the same gold relation to be captured by this cluster.
     94 In other words, the distribution of clusters inside a gold relation must have low entropy:
     95 \begin{equation*}
     96 	\operatorname{completeness}(g, c) = 1 - \frac{\entropy\left(g(\rndm{X})\mid c(\rndm{X})\right)}{\entropy\left(g(\rndm{X})\right)}.
     97 \end{equation*}
     98 As  \bcubed{}, the V-measure is summarized by the \fone{} value:
     99 \begin{equation*}
    100 	\operatorname{V-measure}(g, c) = \frac{2}{\operatorname{homogeneity}(g, c)^{-1} + \operatorname{completeness}(g, c)^{-1}}.
    101 \end{equation*}
    102 \begin{marginfigure}
    103 	\centering
    104 	\input{mainmatter/relation extraction/clustering metrics.tex}
    105 	\scaption[Comparison of \bcubed{} and V-measure.]{
    106 		Comparison of \bcubed{} and V-measure.
    107 		Samples conveying three different relations indicated by shape and color are clustered into three boxes.
    108 		The two rows represent two different clusterings, \bcubed{} favors the first one while V-measure favors the second.
    109 		V-measure prefers the second clustering since the blue star cluster is kept pure; on the other hand, the green circle cluster is impure no matter what, so its purity is not taken as much into account by the V-measure compared to \bcubed{}.
    110 		\label{fig:relation extraction:clustering metrics}
    111 	}
    112 \end{marginfigure}
    113 Compared to \bcubed{}, the V-measure penalizes small impurities in a relatively ``pure'' cluster more harshly than in less pure ones.
    114 Symmetrically, it penalizes a degradation of a well-clustered relation more than of a less-well-clustered one.
    115 This difference is illustrated in Figure~\ref{fig:relation extraction:clustering metrics}.
    116 
    117 \paragraph{Adjusted Rand Index}
    118 The Rand index (\textsc{ri}, \cite{ri}) is the last clustering metric we consider, it is defined as the probability that cluster and gold assignments are compatible:
    119 \begin{equation*}
    120 	\operatorname{\textsc{ri}}(g, c) = \expectation\limits_{\rndm{X},\rndm{Y}} \left[ P\left(
    121 			    c(\rndm{X})=c(\rndm{Y}) \Leftrightarrow g(\rndm{X})=g(\rndm{Y}) 
    122 	\right) \right]
    123 \end{equation*}
    124 In other words, given two samples, the \textsc{ri} is improved when both samples are in the same cluster and convey the same gold relation or when both samples are in different clusters and convey different relations; otherwise, the \textsc{ri} deteriorates.
    125 The adjusted Rand index (\textsc{ari}, \citex{ari}) is a normalization of the Rand index such that a random assignment has an \textsc{ari} of 0, and the maximum is 1:
    126 \begin{equation*}
    127 	\operatorname{\textsc{ari}}(g, c) =
    128 		\frac{\displaystyle\operatorname{\textsc{ri}}(g, c) - \expectation_{c\sim\uniformDistribution(\relationSet^\dataSet)}[\operatorname{\textsc{ri}}(g, c)]}
    129 		{\displaystyle\max_{c\in\relationSet^\dataSet} \operatorname{\textsc{ri}}(g, c) - \expectation_{c\sim\uniformDistribution(\relationSet^\dataSet)}[\operatorname{\textsc{ri}}(g, c)]}
    130 \end{equation*}
    131 In practice, the \textsc{ari} can be computed from the elements of the confusion matrix.
    132 Compared to the previous metrics, \textsc{ari} will be less sensitive to a discrepancy between precision--homogeneity and recall--completeness since it is not a harmonic mean of both.
    133 
    134 \subsubsection{Few-shot}
    135 \label{sec:relation extraction:few-shot}
    136 \begin{marginparagraph}
    137 	This section only presents Few-shot evaluation.
    138 	It is possible---and quite common---to train a model using a few-shot objective, usually as a fine-tuning phase before a few-shot evaluation.
    139 	Since we are mostly interested in unsupervised approaches, we do not delve into few-shot training.
    140 	See \textcite{fewrel} for details.
    141 \end{marginparagraph}
    142 Clustering metrics are problematic since producing a clustering with no a priori knowledge on the relation schema \(\relationSet\) leads to unsolvable problems:
    143 \begin{itemize}
    144 	\item Should the relation \textsl{sibling} be cut into \textsl{brother} and \textsl{sister}?
    145 	\item Is the relation between a country and its capital the same as the one between a county and its seat?
    146 	\item Is the ear \textsl{part of} the head in the same fashion that the star Altair is \textsl{part of} the Aquila constellation?
    147 \end{itemize}
    148 All of these questions can be answered differently depending on the design of the underlying knowledge base.
    149 However, unsupervised clustering algorithms do not depend on \(\relationSet\).
    150 They must decide whether ``Phaedra is the sister of Ariadne'' and ``Castor is the brother of Pollux'' go inside the same cluster independently of these design choices.
    151 
    152 Fine-tuning on a supervised dataset solves this problem but adds another.
    153 The evaluation no longer assesses the proficiency of a model to learn from unlabeled data alone; it also evaluates its ability to adapt to labeled samples.
    154 Furthermore, the smaller the labeled dataset is, the more results have high variance.
    155 On the other hand, the larger the labeled dataset is, the less the experiment evaluates the unsupervised phase.
    156 
    157 A few-shot evaluation can be used to answer these caveats.
    158 Instead of evaluating a clustering of the samples, few-shot experiments evaluate a similarity function between samples: \(\operatorname{sim}\colon \dataSet\times\dataSet\to\symbb{R}\).
    159 Given a query sample \(x^{(q)}\) and a set of candidates \(\vctr{x}^{(c)}=\{x_i^{(c)}\mid i=1,\dotsc,C\}\), %
    160 \begin{marginparagraph}
    161 	\(C\) is the number of candidates, in Table~\ref{tab:relation extraction:few-shot problem} we have \(C=5\).
    162 \end{marginparagraph}
    163 the model is evaluated on whether it is able to find the candidate conveying the same relation as the query.
    164 This is simply reported as an accuracy by comparing \(\argmax_{x\in\vctr{x}^{(c)}} \operatorname{sim}(x^{(q)}, x)\) with the correct candidate.
    165 
    166 \begin{table}[ht!]
    167 	\centering
    168 	\input{mainmatter/relation extraction/few-shot problem.tex}
    169 	\sidecaption[Few-shot problem.]{
    170 		Few-shot problem.
    171 		For ease of reading, the entity identifiers---such as \wdent{450036} for ``Hörsel''---are not given.
    172 		Both the query and the third candidate convey the relation \wdrel{206} \textsl{located in or next to body of water}.
    173 		\label{tab:relation extraction:few-shot problem}
    174 	}
    175 \end{table}
    176 
    177 Table~\ref{tab:relation extraction:few-shot problem} gives an example of a few-shot problem.
    178 It illustrates the five-way one-shot problem, meaning that we must choose a relation among five and that each of the five relations is represented by a single sample.
    179 Another popular variant is the ten-way five-shot problem: the candidates are split into ten bags of five samples each, all samples in a bag convey the same relation, and the goal is to predict the bag in which the query belongs.
    180 \begin{marginparagraph}
    181 	Quite confusingly, they can also be referred to as ``meta-train'' and ``meta-test.''
    182 	Indeed, to follow the usual semantic of the ``meta-'' prefix, the ``meta-sets'' should refer to sets of \((\text{query}, \text{candidates})\) tuples, not the candidates themselves.
    183 \end{marginparagraph}
    184 Candidates are sometimes referred to as ``train set'' and the query as ``test set'' since this can be seen as an extremely small dataset with five training samples and one test sample.
    185 
    186 FewRel, described in Section~\ref{sec:datasets:fewrel}, is the standard few-shot dataset.
    187 In FewRel, Altair is not \wdrel{361} \textsl{part of} Aquila, it is \wdrel{59} \textsl{part of constellation} Aquila.
    188 However, this design decision does not influence the evaluation.
    189 Given the query ``Altair is located in the Aquila constellation,'' a model ought to rank this sample as more similar to samples conveying \textsl{part of constellation} than to those conveying other kinds of \textsl{part of} relationships.
    190 If FewRel made the opposite design choice, the model would still be able to achieve high accuracy by ensuring \textsl{part of} samples are similar.
    191 The decision to split or not the \textsl{part of} relation should be of no concern to the unsupervised model.
    192 
    193 \subsection{Open Information Extraction}
    194 \label{sec:relation extraction:oie}
    195 In Open information extraction (\textsc{oie}, \citex{oie}), the closed-domain assumption (Section~\ref{sec:relation extraction:domain restriction}) is neither made for relations nor entities, which are extracted jointly.
    196 Instead \(\entitySet\) and \(\relationSet\) are implicitly defined from the language itself, typically a fact \((e_1, r, e_2)\) is expressed as a triplet such as (noun phrase, verb phrase, noun phrase).
    197 This makes \textsc{oie} particularly interesting when processing large amounts of data from the web, where there can be many unanticipated relations of interest.
    198 
    199 This section focuses on TextRunner, the first model implementing \textsc{oie}.
    200 It uses an aggregate extraction setup where \(\dataSet\) is directly mapped to \(\kbSet\), with the peculiarity that \(\kbSet\) is defined using surface forms only.
    201 The hypothesis on which TextRunner relies is that the surface form of the relation conveyed by a sentence appears in the path between the two entities in its dependency tree.
    202 In the \textsc{oie} setup, these surface forms can then be used as labels for the conveyed relations, thereby using the language itself as the relation domain \(\relationSet\).
    203 TextRunner can be split into three parts:
    204 \begin{description}
    205 	\item[The Learner] is a naive Bayes classifier, trained on a small dataset to predict whether a fact \((e_1, r, e_2)\) is trustworthy.
    206 		To extract a set of samples for this task, a dependency parser (Figure~\ref{fig:relation extraction:dependency tree}) is run on the dataset and tuples \((e_1, r, e_2)\) are extracted where \(e_1\) and \(e_2\) are base noun phrases and \(r\) is the dependency path between the two entities.
    207 		The tuples are then automatically labeled as trustworthy or not according to a set of heuristics such as the length of the dependency path and whether it crosses a sentence boundary.
    208 		The naive Bayes classifier is then trained to predict the trustworthiness of a tuple given a set of hand-engineered features (Section~\ref{sec:relation extraction:hand-designed features}).
    209 	\item[The Extractor] extracts trustworthy facts on the whole dataset.
    210 		The features on which the Learner is built only depend on part-of-speech (\textsc{pos}) tags (noun, verb, adjective\dots) such that the Extractor does not need to run a dependency parser on all the sentences in the entire dataset.
    211 		\begin{marginparagraph}
    212 			Dependency parsers tend to be a lot slower than \textsc{pos} taggers.
    213 		\end{marginparagraph}
    214 		While the Learner uses the dependency path for \(r\), the Extractor uses the infix from which non-essential phrases (such as adverbs) are eliminated heuristically.
    215 		Thus the Extractor simply runs a \textsc{pos} tagger on all sentences, finds all possible entities \(e\), estimates a probable relation \(r\) and filters them using the Learner to output a set of trustworthy facts.
    216 	\item[The Assessor] assigns a probability that a fact is true from redundancy in the dataset using the urns model of \textcite{textrunner_assessor}.
    217 		This model uses a binomial distribution to model the probability that a correct fact appears \(k\) times among \(n\) extractions with a fixed repetition rate.
    218 		Furthermore, it assumes both correct and incorrect facts follow different Zipf's laws.
    219 		The shape parameter \(s_I\) of the distribution of incorrect facts is assumed to be 1.
    220 		While the shape parameter \(s_C\) of the distribution of correct facts as well as the number of correct facts \(N_C\) are estimated using an expectation--maximization algorithm.
    221 		\begin{marginparagraph}[-3mm]
    222 			Zipf's law comes from the externalist linguistic school.
    223 			It follows from the observation that the frequency of the second most common word is half the one of the most frequent word, that the one of the third most common word is a third of the one of the most frequent, etc.
    224 			The same distribution can often be observed in information extraction.
    225 			Zipf's law is parametrized by a shape \(s\) and the number of elements \(N\):
    226 			\begin{equation*}
    227 				P(x\mid s) \propto
    228 					\begin{cases}
    229 						x^{-s} & \text{ for } x\in\{1,\dotsc,N\} \\
    230 						0 & \text{ otherwise } \\
    231 					\end{cases}
    232 			\end{equation*}
    233 			A Zipf's law is easily recognizable on a \(\log\)--\(\log\) scale, its probability mass function being a straight line.
    234 			Take for example the Zipf's law with parameters \(s=2\) and \(N=10\):
    235 			\begin{center}
    236 				\input{mainmatter/relation extraction/zipf.tex}
    237 			\end{center}
    238 		\end{marginparagraph}
    239 		In the expectation step, the binomial and Zipf distribution assumptions can be combined using Bayes' theorem to estimate whether a fact is correct or not.
    240 		In the maximization step, the parameters \(s_C\) and \(N_C\) are estimated.
    241 \end{description}
    242 
    243 \textcite{oie} compare their approach to KnowItAll, an earlier work similar to \textsc{oie} but needing a list of relations (surface forms) as input to define the target relation schema \(\relationSet\).
    244 On a set of ten relations, they manually labeled the extracted facts as correct or not, obtaining an error rate of 12\% for TextRunner and 18\% for KnowItAll.
    245 They further run their model on 9 million web pages, extracting 7.8 million facts.
    246 
    247 A limitation of the \textsc{oie} approach is that it heavily depends on the raw surface form and suffers from bad generalization.
    248 The two facts ``Bletchley Park \textsl{known as} Station X'' and ``Bletchley Park \textsl{codenamed} Station X'' are considered different by TextRunner since the surface forms conveying the relations in the underlying sentences are different.
    249 Subsequent \textsc{oie} approaches try to address this problem, such as \textcite{textrunner_synonym}, which extend TextRunner with a resolver \parencite{textrunner_resolver} to merge synonyms.
    250 However, this problem is not overcome yet and is still an active area of research.
    251 Furthermore, since the input of \textsc{oie} systems is often taken to be the largest possible chunk of the web, and since the extracted facts do not follow a strict nomenclature, a fair evaluation of \textsc{oie} systems among themselves or to other unsupervised relation extraction models is still not feasible.
    252 
    253 \subsection{Clustering Surface Forms}
    254 \label{sec:relation extraction:hasegawa}
    255 The first unsupervised relation extraction model was the clustering approach of \textcitex{hasegawa}.
    256 It is somewhat similar to \textsc{dirt} (Section~\ref{sec:relation extraction:dirt}) in that it uses a similarity between samples.
    257 However, their work goes one step further by using this similarity to build relation classes.
    258 Furthermore, \textcite{hasegawa} does not assume \hypothesis{pullback}, i.e.~it does not assume that the sentence and entities convey the relation separately, on their own.
    259 Instead, its basic assumption is that the infix between two entities is the expression of the conveyed relation.
    260 \begin{marginparagraph}
    261 	As a reminder, the infix is the span of text between the two entities in the sentence.
    262 \end{marginparagraph}
    263 As such, if two infixes are similar, the sentences convey similar relations.
    264 Furthermore, \textsc{ner} (see the introduction of Chapter~\ref{chap:relation extraction}) is performed on the text instead of simple entity chunking.
    265 This means that all entities are tagged with a type such as ``organization'' and ``person.''
    266 These types strongly constrain the relations through the following assumption:
    267 \begin{marginparagraph}[-9mm]
    268 	Following Section~\ref{sec:context:relation algebra}, \(\breve{r}\) is the converse relation of \(r\), i.e.~the relation with \(e_1\) and \(e_2\) in the reverse order.
    269 	\(\relationComposition\) is the composition operator and \(\relationOne_X\) the complete relation over \(X\).
    270 	\(r\relationComposition\breve{r}\) is the relation linking all the entities which appear as subject (\(e_1\), on the left hand side) of \(r\) to themselves.
    271 	This relation is constrained to be between entities in \(X\).
    272 	Less relevant to this formula, \(r\relationComposition\breve{r}\) also links together entities linked by \(r\) to the same object.
    273 \end{marginparagraph}
    274 \begin{assumption}{type}
    275 	All entities have a unique type, and all relations are left and right restricted to one of these types.
    276 
    277 	\smallskip
    278 	\noindent
    279 	\(
    280 		\exists \symcal{T} \textup{ partition of } \entitySet :
    281 		\forall r\in\relationSet :
    282 		\exists X, Y\in \symcal{T} :
    283 		r\relationComposition \breve{r} \relationOr \relationOne_X = \relationOne_X
    284 		\; \land \;
    285 		\breve{r}\relationComposition r \relationOr \relationOne_Y = \relationOne_Y
    286 	\)
    287 \end{assumption}
    288 \begin{marginparagraph}[9mm]
    289 	Here, we assume that the partition \(\symcal{T}\) is not degenerate and somewhat looks like a standard \textsc{ner} classification output.
    290 	Otherwise, \(\symcal{T}=\{\entitySet\}\) is a valid partition of \(\entitySet\), and this assumption is tautological.
    291 \end{marginparagraph}
    292 This is a natural assumption for many relations; for example, the relation \textsl{born in} is always between a person and a geopolitical entity (\textsc{gpe}).
    293 
    294 Given a pair of entities \((e_1, e_2)\in\entitySet^2\), \textcite{hasegawa} collect all samples in which they appear and extract a single vector representation from all these samples.
    295 This representation is built from the bag of words of the infixes weighted by \textsc{tf--idf} (term frequency--inverse document frequency).
    296 Since a bag of words discards the ordering of the words or entities, the variant of \textsc{tf--idf} used takes into account the directionality:
    297 \begin{align*}
    298 	\textsc{tf}(w, e_1, e_2) = &
    299 			 \text{number of times \(w\) appears between \(e_1\) and \(e_2\)} \\
    300 			& \hspace{5mm} - \text{number of times \(w\) appears between \(e_2\) and \(e_1\)} \\
    301 	\textsc{idf}(w) = & (\text{number of documents in which \(w\) appears})^{-1}
    302 	\\
    303 	\textsc{tf--idf}(w, e_1, e_2) = & \textsc{tf}(w, e_1, e_2) \cdot \textsc{idf}(w)
    304 \end{align*}
    305 
    306 From this definition we obtain a representation \(\vctr{z}_{e_1, e_2}\in\symbb{R}^{V}\) of the pair \((e_1, e_2)\in\entitySet^2\) by taking the value of \(\textsc{tf--idf}(w, e_1, e_2)\) for all \(w\in V\).
    307 Given two entity pairs, their similarity is defined as follow:
    308 \begin{equation*}
    309 	\operatorname{sim}(\vctr{e}, \vctr{e}')
    310 	= \cos(\vctr{z}_\vctr{e}, \vctr{z}_\vctr{e'})
    311 	= \frac{\vctr{z}_\vctr{e} \cdot \vctr{z}_\vctr{e'}}{\|\vctr{z}_\vctr{e}\| \|\vctr{z}_\vctr{e'}\|}.
    312 \end{equation*}
    313 
    314 Using this similarity function, the complete-linkage clustering algorithm%
    315 \sidenote{
    316 	The complete-linkage algorithm is an agglomerative hierarchical clustering method also called farthest neighbor clustering.
    317 	The algorithm starts with each sample in its own cluster then merges the clusters two by two until reaching the desired number of clusters.
    318 	At each step, the two closest clusters are merged together, with the distance between clusters being defined as the distance between their farthest elements.
    319 }
    320 \parencite{complete_linkage} is used to extract relations classes.
    321 Since each pair end up in a single cluster, this assumes \hypothesis{1-adjacency}.
    322 \Textcite{hasegawa} evaluate their method on articles from the New York Times (\textsc{nyt}).
    323 They extract relations classes by first clustering all \(\vctr{z}_{e_1, e_2}\) where \(e_1\) has the type person and \(e_2\) has the type \textsc{gpe}, and then by clustering all \(\vctr{z}_{e_1, e_2}\) where both \(e_1\) and \(e_2\) are organizations.
    324 By clustering separately different type combinations, they ensure that \hypothesis{type} is enforced.
    325 
    326 They furthermore experiment with automatic labeling of the clusters with the most frequent word appearing in the samples.
    327 Apart from the relation \textsl{prime minister}, which is simply labeled ``minister'' since only unigrams are considered, the labels are rather on point.
    328 To measure the performance of their model, they use a classical supervised \fone{} where each cluster is labeled by the majority gold relation.
    329 Using this somewhat unadapted metric, they reach an \fone{} of 82\% on person--\textsc{gpe} pairs and an \fone{} of 77\% on organization--organization pairs.
    330 This relatively high score compared to subsequent models can be explained by the small size of their dataset, which is further split by entity type.
    331 Furthermore, note that some generic relations such as \textsl{part of} do not follow \hypothesis{type} and, as such, cannot be captured.
    332 
    333 \subsection{Rel-\textsc{lda}}
    334 \label{sec:relation extraction:rellda}
    335 Rel-\textsc{lda} \parencitex{rellda} is a probabilistic generative model inspired by \textsc{lda}.
    336 It works by clustering sentences: each relation defines a distribution over a handcrafted set of sentence features (Section~\ref{sec:relation extraction:hand-designed features}) describing the relationship between the two entities in the text.
    337 Furthermore, rel-\textsc{lda} models the propensity of a relation at the level of the document; thus, it is not strictly speaking a sentence-level relation extractor.
    338 The idea behind modeling this additional information is that when a relation such as \wdrel{413} \textsl{position played on team} appears in a document, other relations pertaining to sports are more likely to appear.
    339 Figure~\ref{fig:relation extraction:rellda plate} gives the plate diagram for the rel-\textsc{lda} model.
    340 It uses the following variables:
    341 \begin{itemize}[nosep,label={}]
    342 	\item \(\rndmvctr{f}_i\) the features of the \(i\)-th sample, where \(\rndm{f}_{ij}\) is its \(j\)-th feature
    343 	\item \(\rndm{r}_i\) the relation of the \(i\)-th sample
    344 	\item \(\rndm{\theta}_d\) the distribution of relations in the document \(d\)
    345 	\item \(\rndm{\phi}_{rj}\) the probability of the \(j\)-th feature to occurs for the relation \(r\)
    346 	\item \(\alpha\) the Dirichlet prior for \(\rndm{\theta}_d\)
    347 	\item \(\beta\) the Dirichlet prior for \(\rndm{\phi}_{rj}\)
    348 \end{itemize}
    349 The generative process is listed as Algorithm~\ref{alg:relation extraction:rellda generation}.
    350 The learning process uses the expectation--maximization algorithm.
    351 In the variational E-step, the relation for each sample \(r_i\) is sampled from the categorical distribution:
    352 \begin{equation*}
    353 	P(r_i\mid \vctr{f}_i, d) \propto P(r_i\mid d) \prod_{j=1}^m P(f_{ij}\mid r_i)
    354 \end{equation*}
    355 \begin{marginfigure}
    356 	\centering
    357 	\input{mainmatter/relation extraction/rellda plate.tex}
    358 	\scaption[Rel-\textsc{lda} plate diagram.]{
    359 		Rel-\textsc{lda} plate diagram.
    360 		\(D\) is the number of documents in the dataset and \(n_d\) is the number of samples in the document \(d\).
    361 		For each sample \(i\), there are several features \(\rndm{f}_{i1}, \rndm{f}_{i2}, \dotsc, \rndm{f}_{im}\), accordingly for each relation \(r\), there are also several feature priors \(\rndm{\phi}_{r1}, \dotsc, \rndm{\phi}_{rm}\), however for simplicity, a single prior is shown here.
    362 		\label{fig:relation extraction:rellda plate}
    363 	}
    364 \end{marginfigure}%
    365 \begin{marginalgorithm}
    366 	\centering
    367 	\input{mainmatter/relation extraction/rellda.tex}
    368 	\scaption[The rel-\textsc{lda} generative process.]{
    369 		The rel-\textsc{lda} generative process.
    370 		\(\operatorname{Dir}\) are Dirichlet distributions.
    371 		\(\operatorname{Cat}\) are categorical distributions.
    372 		\label{alg:relation extraction:rellda generation}
    373 	}
    374 \end{marginalgorithm}%
    375 where \(P(r\mid d)\) is defined by \(\theta_d\) and \(P(f_{ij}\mid r)\) is defined by \(\phi_{rj}\).
    376 In the M-step, the values for \(\theta_d\) are computed by counting the number of times each relation appears in \(d\) and the hyperprior \(\alpha\); and the value for \(\phi_{rj}\) is computed from the number of co-occurrences of the \(j\)-th feature with the relation \(r\) and from \(\beta\).
    377 
    378 \Textcite{rellda} evaluate their model on the New York Times by comparing their clusters to relations in Freebase.
    379 However, because of the incompleteness of knowledge bases, they only evaluate the recall on Freebase and use manual annotation to estimate the precision.
    380 Even though the original article lacks a significant comparison, subsequent approaches often compare to rel-\textsc{lda}.
    381 
    382 A first limitation of their approach is that given the relation \(r\), the features \(f\) are independents.
    383 Since the entities are among those features, this means that \(P(e_2\mid e_1, r) = P(e_2\mid r)\) which is clearly false.
    384 \begin{assumption}{biclique}
    385 	Given a relation, the entities are independent of one another: \( \displaystyle \rndm{e}_1 \independent \rndm{e}_2 \mid \rndm{r} \).
    386 	In other words, given a relation, all possible head entities are connected to all possible tail entities.
    387 	
    388 	\smallskip
    389 	\noindent
    390 	\(\forall r\in\relationSet:\exists A,B\subseteq\entitySet: r\relationComposition\breve{r}=\relationOne_A\land\breve{r}\relationComposition r=\relationOne_B\)
    391 \end{assumption}
    392 This is a widespread problem with generative models which are inclined to make extensive independence assumptions.
    393 Furthermore, generative models have an implicit bias that all observed features are related to relation extraction, even though they might measures other aspect of the sample (style, idiolectal word choice, etc).
    394 This might results in the model focusing on features not related to the relation extraction task.
    395 
    396 Several extensions of rel-\textsc{lda} were proposed.
    397 Type-\textsc{lda} \parencite{rellda} purpose to model entity types which are latent variables of entity features, themselves generated from the relation variable \(r\), thus softly enforcing \hypothesis{type}.
    398 Sense-\textsc{lda} \parencitex{rellda_sense} use a \textsc{lda}-like model for each different dependency path.
    399 Clusters for different paths are then merged into relation clusters.
    400 
    401 Rel-\textsc{lda} is an important work in that it proposes a simple evaluation framework; in particular, it introduces the \bcubed{} metric to unsupervised relation extraction.
    402 However, it predates the advent of neural networks and distributed representations in relation extraction, by which it was bound to be replaced.
    403 
    404 \subsection{Variational Autoencoder for Relation Extraction}
    405 \label{sec:relation extraction:vae}
    406 \Textcitex{vae_re} were first to propose a discriminative unsupervised relation extraction model.
    407 Discriminative models directly solve the inference problem of finding the posterior \(P(r\mid x)\).
    408 This is in contrast to generative models such as rel-\textsc{lda} which determine \(P(x\mid r)\) and then use Bayes' theorem to compute \(P(r\mid x)\) and make a prediction.
    409 The model of \textcite{vae_re} is closely related to the approach presented in Chapter~\ref{chap:fitb}.
    410 It is a clustering model, meaning that it produces clusters of samples where the samples in each cluster all convey the same relation.
    411 To do so, it uses a variational autoencoder model (\textsc{vae}, \citex{vae}) that we now describe.
    412 
    413 \paragraph{Variational Autoencoder}
    414 \begin{marginfigure}
    415 	\centering
    416 	\input{mainmatter/relation extraction/vae plate.tex}
    417 	\scaption[\textsc{vae} plate diagram.]{
    418 		\textsc{vae} plate diagram.
    419 		\(N\) is the number of samples in the dataset.
    420 		\label{fig:relation extraction:vae plate}
    421 	}
    422 \end{marginfigure}
    423 The goal of a variational autoencoder is to learn a latent variable \(\vctr{z}\) which explains the distribution of an observed variable \(\vctr{x}\).
    424 For our problem, the latent variable corresponds to the relation conveyed by the sample \(\vctr{x}\).
    425 We assume we know the generative process \(P(\vctr{x}\mid \vctr{z}; \vctr{\theta})\), i.e.~this process is the ``decoder'' (parametrized by \(\vctr{\theta}\)): given the latent variable it produces a sample.
    426 However, the process of interest to us is to estimate the latent variable---the relation---from a sample, that is \(P(\vctr{z}\mid\vctr{x}; \vctr{\theta})\).
    427 Using Bayes' theorem we can reformulate this posterior as \(P(\vctr{x}\mid \vctr{z}; \vctr{\theta})P(\vctr{z}\mid \vctr{\theta}) \divslash P(\vctr{x}\mid \vctr{\theta})\).
    428 However, computing \(P(\vctr{x}\mid \vctr{\theta})\) is often intractable, especially when the likelihood \(P(\vctr{x}\mid \vctr{z}; \vctr{\theta})\) is modeled using a complicated function like a neural network.
    429 To solve this problem, a variational approach is used: another model \(Q\) parametrized by \(\vctr{\phi}\) is used to approximate \(P(\vctr{z}\mid\vctr{x}; \vctr{\theta})\) as well as possible.
    430 This approximation \(Q(\vctr{z}\mid\vctr{x};\vctr{\phi})\) is the ``encoder'' since it finds the latent variable associated with a sample.
    431 The model can then be trained by maximizing the log-likelihood given the latent variable estimated by \(Q\) and by minimizing the difference between the latent variable predicted by \(Q\) and the desired prior \(P(\vctr{z}\mid\vctr{\theta})\):
    432 \begin{equation}
    433 	J_\textsc{elbo}(\vctr{\theta}, \vctr{\phi}) = \expectation_{Q(\vctr{z}\mid \vctr{x}; \vctr{\phi})}[\log P(\vctr{x}\mid\vctr{z};\vctr{\theta})] - \kl(Q(\vctr{z}\mid \vctr{x}; \vctr{\phi}) \mathrel{\|} P(\vctr{z}\mid\vctr{\theta}))
    434 	\label{eq:relation extraction:elbo}
    435 \end{equation}
    436 A justification for this objective can also be found in the fact that it's a lower bound of the log marginal likelihood \(\log P(\vctr{x}\mid \vctr{\theta})\), hence its name: evidence lower bound (\textsc{elbo}).
    437 The first part of the objective is often referred to as the negative reconstruction loss since it seeks to reconstruct the sample \(\vctr{x}\) after it went through the encoder \(Q\) and the decoder \(P\).
    438 One last problem with the \textsc{vae} approximation relates to the reconstruction loss, the estimation of the expectation over \(Q(\vctr{z}\mid\vctr{x};\vctr{\phi})\) not being differentiable which makes the model---in particular \(\vctr{\phi}\)---untrainable by gradient descent.
    439 This is usually solved using the reparameterization trick: sampling from \(Q(\vctr{z}\mid\vctr{x};\vctr{\phi})\) can often be done in a two steps process: sampling from a simple distribution like \(\epsilon\sim\normalDistribution(0, 1)\) then transforming this sample using a deterministic process parametrized by \(\vctr{\phi}\).
    440 The plate diagram of the \textsc{vae} is given Figure~\ref{fig:relation extraction:vae plate} where the model \(P\) is marked with solid lines and the variational approximation \(Q\) is marked with dashed lines.
    441 
    442 \begin{marginfigure}
    443 	\centering
    444 	\input{mainmatter/relation extraction/marcheggiani plate.tex}
    445 	\scaption[\textcite{vae_re} plate diagram.]{
    446 		\textcite{vae_re} plate diagram.
    447 		\label{fig:relation extraction:marcheggiani plate}
    448 	}
    449 \end{marginfigure}
    450 
    451 \bigskip
    452 
    453 Coming back to the model of \textcite{vae_re}, it is a conditional \(\beta\)\textsc{-vae},%
    454 \sidenote{
    455 	The \(\beta\) in ``\(\beta\)\textsc{-vae}'' simply indicates that the Kullback--Leibler term in Equation~\ref{eq:relation extraction:elbo} is weighted by a hyperparameter \(\beta\).
    456 	More details are given in Chapter~\ref{chap:fitb}.
    457 }
    458 i.e.~the whole process is conditioned on an additional variable.
    459 Indeed, in their approach, only the entities \(\vctr{e}\in\entitySet^2\) are reconstructed, while the sentence \(s\in\sentenceSet\) simply conditions the whole process.
    460 The latent variable explaining the observed entities is expected to be the relation conveyed by the sample.
    461 The resulting model's plate diagram is given in Figure~\ref{fig:relation extraction:marcheggiani plate}.
    462 This approach is defined by two models:
    463 \begin{description}
    464 	\item[The Encoder] \(Q(\rndm{r}\mid\vctr{e}, s; \vctr{\phi})\) is the relation extraction model properly speaking.
    465 		It is defined as a linear model on top of handcrafted features (Section~\ref{sec:relation extraction:hand-designed features}).
    466 		For each sample, the model outputs a distribution over a predefined number of relations.
    467 	\item[The Decoder] \(P(\vctr{e}\mid r; \vctr{\theta})\) is a model estimating how likely it is for two entities to be linked by a relation.
    468 		It is a reconstruction model since the entities \(\vctr{e}\) are known and need to be retrieved from the latent relation \(r\) sampled from the encoder.
    469 		It is defined using selectional preferences (Section~\ref{sec:context:selectional preferences}) and \textsc{rescal} (Section~\ref{sec:context:rescal}).
    470 \end{description}
    471 Note that to label a sample \((\vctr{e}, s)\in\dataSet\), \textcite{vae_re} simply select \(\argmax_{r\in\relationSet} Q(r\mid \vctr{e}, s; \vctr{\phi})\), meaning that the decoder is not used during evaluation.
    472 Its sole purpose is to provide a supervision signal to the encoder through the maximization of \(J_\textsc{elbo}\).
    473 The whole autoencoder can also be interpreted as being trained by a surrogate task of filling-in entity blanks.
    474 This is the interpretation we use in Chapter~\ref{chap:fitb}.
    475 
    476 For Equation~\ref{eq:relation extraction:elbo} to be well defined, a prior on the relations must also be selected; \textcite{vae_re} make the following assumption:
    477 \begin{assumption}{uniform}
    478 	All relations occur with equal frequency.
    479 	
    480 	\smallskip
    481 	\noindent
    482 	\( \displaystyle \forall r\in\relationSet\colon P(r) = \frac{1}{|\relationSet|} \)
    483 \end{assumption}
    484 
    485 They evaluate their approach on the New York Times distantly supervised by Freebase.
    486 By inducing 100 clusters, they show an improvement of the \bcubed{} \fone{} compared to \textsc{dirt} (Section~\ref{sec:relation extraction:dirt}) and rel-\textsc{lda} (Section~\ref{sec:relation extraction:rellda}).
    487 They also experiment using semi-supervised evaluation (Section~\ref{sec:relation extraction:unsupervised evaluation}) by pre-training their decoder on a subset of Freebase before training their encoder as described above; this additional supervision improves the \fone{} by more than 27\%.
    488 These results were further improved by \textcite{vae_re2}, which proposed to split the latent variable into a relation \(r\) and sentence information \(z\), with \(z\) conditioned on \(r\) and using a loss including the reconstruction of the sentence \(s\) from \(z\).
    489 
    490 \subsection{Matching the Blanks}
    491 \label{sec:relation extraction:mtb}
    492 Matching the blanks (\textsc{mtb}, \citex{mtb}) is an unsupervised method that does not attempt to cluster samples but rather learns a representation of the relational semantics they convey.
    493 More precisely, this representation is used to measure the similarity between samples such that similar samples convey similar relations.
    494 As such, it is either evaluated as a supervised pre-training method (Section~\ref{sec:relation extraction:unsupervised evaluation}) or using a few-shot dataset (Section~\ref{sec:relation extraction:few-shot}).
    495 The \textsc{mtb} article introduces several methods to extract an entity-aware representation of a sentence using \textsc{bert}; this was discussed in Section~\ref{sec:relation extraction:mtb sentential}.
    496 This section focuses on the unsupervised training.
    497 As a reminder, we refer to sentence encoder of \textsc{mtb} by the function \(\bertcoder\colon\sentenceSet\to\symbb{R}^d\) illustrated Figure~\ref{fig:relation extraction:emes}.
    498 Given this encoder, \textsc{mtb} defines the similarity between samples as:
    499 \begin{equation}
    500 	\operatorname{sim}(s, s') = \sigmoid(\bertcoder(s)\transpose\bertcoder(s'))
    501 	\label{eq:relation extraction:mtb similarity}
    502 \end{equation}
    503 This similarity function can be used to evaluate the model on a few-shot task.
    504 Note that this function completely ignores entities identifiers (e.g.\ \wdent{211539}), but can still exploit the entities surface forms (e.g.\ ``Peter Singer'') through the sentence \(s\in\sentenceSet\).
    505 This model can be used as is, without any training other than the masked language model pre-training of \textsc{bert} (Section~\ref{sec:context:mlm}) and reach an accuracy of 72.9\% on the FewRel 5 way 1 shot dataset.
    506 
    507 \Textcite{mtb} propose a training objective to fine-tune \textsc{bert} for the unsupervised relation extraction task.
    508 This objective is called matching the blanks.
    509 It assumes that two sentences containing the same entities convey the same relation.
    510 This is exactly \hypothesis{1-adjacency} as given Section~\refAssumptionSection{oneadjacency}.
    511 The probability that two sentences convey the same relation (\(\rndm{D}=1\)) is taken from the similarity function: \(P(\rndm{D}=1\mid s, s')=\operatorname{sim}(s, s')\).
    512 Given this, the \hypothesis{1-adjacency} assumption is translated into the following negative sampling (Section~\ref{sec:context:negative sampling}) loss:
    513 \begin{equation}
    514 	\loss{mtb} = \frac{-1}{|\dataSet|^2} \sum_{\substack{(\vctr{e}, s)\in\dataSet\\(\vctr{e}', s')\in\dataSet}}
    515 	\begin{array}[t]{l}
    516 		\delta_{\vctr{e},\vctr{e}'} \log P(\rndm{D}=1\mid s, s') \\
    517 		\hspace{1cm} + (1 - \delta_{\vctr{e},\vctr{e}'}) \log P(\rndm{D}=0\mid s, s') \\
    518 	\end{array}
    519 	\label{eq:relation extraction:mtb loss}
    520 \end{equation}
    521 This loss is minimized through gradient descent by sampling random positive and negative sentence pairs.
    522 These pairs can be obtained by comparing the entity identifier without the need for any supervision.
    523 
    524 A problem with this approach is that the \bertcoder{} model can simply learn to perform entity linking on the entities surface forms in the sentences \(s\), thus minimizing Equation~\ref{eq:relation extraction:mtb loss} by predicting whether \(\vctr{e}=\vctr{e}'\).
    525 We want to avoid this since this would only work on samples seen during training and would not generalize to unseen entities.
    526 To ensure the model predicts whether the samples convey the same relation from the sentences \(s\) and \(s'\) alone, blanks are introduced.
    527 A special token \blanktag{} is substituted to the entities as follow:
    528 \begin{indentedexample}
    529 	\uhead{\blanktag}, inspired by Cale's earlier cover, recorded one of the most acclaimed versions of ``\utail{\blanktag}.''\\
    530 	\smallskip
    531 	\uhead{\blanktag}'s rendition of ``\utail{\blanktag}'' has been called ``one of the great songs'' by Time\dots
    532 \end{indentedexample}
    533 This is similar to the sample corruption of \textsc{bert} (Section~\ref{sec:context:mlm}), indeed like \textsc{bert}, the entity surface forms are blanked only a fraction%
    534 \sidenote{\Textcite{mtb} blanks each entity with a probability of 70\%, meaning that only 9\% of training samples have both of their entity surface forms intact.}
    535 of the time so as to not confuse the model when real entities appear during evaluation.
    536 
    537 Another problem with Equation~\ref{eq:relation extraction:mtb loss} is that the negative sample space \(\vctr{e}\neq\vctr{e}'\) is extremely large.
    538 Instead of taking negative samples randomly in this space, \textcite{mtb} propose to take only samples which are likely to be close to positive ones.
    539 To this end, the \(\vctr{e}\neq\vctr{e}'\) condition is actually replaced with the following one:
    540 \begin{equation*}
    541 	|\{e_1, e_2\} \cap \{e_1', e_2'\}| = 1
    542 \end{equation*}
    543 These are called ``strong negatives'': negative samples that have precisely one entity in common.
    544 Negative sampling, especially with strong negatives, leads to another unfortunate assumption:
    545 \begin{assumption}[onetoone]{\(1\to1\)}
    546 	All relations are one-to-one.
    547 
    548 	\smallskip
    549 	\noindent
    550 	\( \forall r\in\relationSet\colon
    551 	r\relationComposition \breve{r} \relationOr \relationIdentity
    552 	= \breve{r}\relationComposition r \relationOr \relationIdentity
    553 	= \relationIdentity \)
    554 \end{assumption}
    555 Indeed, if a relation is not one-to-one, then there exists two facts \tripletHolds{e_1}{r}{e_2} and \tripletHolds{e_1}{r}{e_3} (or respectively with \(\breve{r}\)); however these two facts form a strong negative pair, therefore as per \loss{mtb} their representations must be pulled away from one another.
    556 
    557 Despite these assumptions, \textsc{mtb} showcase impressive results, both as a few-shot and supervised pre-training method.
    558 It obtained state-of-the-art results both on the SemEval 2010 Task 8 dataset with a macro-\(\overHalfdirected{\fone}\) %
    559 \begin{marginparagraph}[-6mm]
    560 	As a reminder, \(\overHalfdirected{\fone}\) is the half-directed metric described Section~\ref{sec:relation extraction:supervised evaluation}.
    561 	It is referred to as ``taking directionality into account'' in the SemEval dataset.
    562 \end{marginparagraph}
    563 of 82.7\% and on FewRel with an accuracy of 90.1\% on the 5 way 1 shot task.
    564 
    565 \subsection{Self\textsc{ore}}
    566 \label{sec:relation extraction:selfore}
    567 Self\textsc{ore} \parencitex{selfore} is a clustering approach similar to the one of \textcite{hasegawa} presented in Section~\ref{sec:relation extraction:hasegawa} but using deep neural network models for extracting sentence representations and for grouping these representations into relation clusters.
    568 Since they follow the experimental setup of \textcite{fitb}, which we present in Chapter~\ref{chap:fitb}, their results are listed in that chapter.
    569 
    570 Self\textsc{ore} uses \textsc{mtb}'s entity markers--entity start \bertcoder{} sentence representation.
    571 A clustering algorithm could be run to produce relation classes from these representations a la \textcite{hasegawa}.
    572 However, \textcite{selfore} introduce an iterative scheme to purify the clusters.
    573 This scheme is illustrated in Figure~\ref{fig:relation extraction:selfore} and works by alternatively optimizing two losses \loss{ac} and \loss{rc}.
    574 
    575 The first loss \loss{ac} is the clustering loss which comes from \textsc{dec} \parencitex{dec}.
    576 \textsc{dec} is a deep clustering algorithm that uses a denoising autoencoder \parencite{dae} to compress the input.
    577 In their case, the input \(\vctr{h}\) is the sentence encoded by \bertcoder{}.
    578 The denoising autoencoder is trained layer by layer with a small bottleneck which produces a compressed representation of the sentence \(\vctr{z}=\operatorname{Encoder}(\vctr{h})\).
    579 This is the space in which the clustering occurs.
    580 For each cluster \(j=1,\dotsc,K\), a centroid%
    581 \sidenote{
    582 	The \(k\)-means clustering algorithm is used to initialize the centroids.
    583 	In practice, the \(k\)-means clusters could directly be used as soft labels.
    584 	However, \textcite{selfore} show that this underperforms compared to refining the clusters with \loss{ac}.
    585 }
    586 \(\vctr{\mu}_j\) is learned such that a sentence is part of the cluster whose centroid is the closest to its compressed representation.
    587 This is modeled with a Student's \(t\)-distribution with one degree of freedom centered around the centroid:
    588 \begin{marginfigure}
    589 	\centering
    590 	\input{mainmatter/relation extraction/selfore.tex}
    591 	\scaption[Self\textsc{ore} iterative algorithm.]{
    592 		Self\textsc{ore} iterative algorithm.
    593 		\label{fig:relation extraction:selfore}
    594 	}
    595 \end{marginfigure}
    596 \begin{equation*}
    597 	q_{ij} = \frac{(1+\|\vctr{z}_i-\vctr{\mu}_j\|^2)^{-1}}{\sum_k (1+\|\vctr{z}_i-\vctr{\mu}_k\|^2)^{-1}}
    598 \end{equation*}
    599 To force the initial clusters to be more distinct, a target distribution \(p\) is defined as:
    600 \begin{equation}
    601 	p_{ij} = \frac{q_{ij}^2 \divslash f_j}{\sum_k q_{ik}^2 \divslash f_k}
    602 	\label{eq:relation extraction:selfore target}
    603 \end{equation}
    604 where \(f_j=\sum_i q_{ij}\) are soft cluster frequencies.
    605 To push \(\mtrx{Q}\) towards \(\mtrx{P}\), a Kullback--Leibler divergence is used:
    606 \begin{equation*}
    607 	\loss{ac} = \kl(\mtrx{P}\mathrel{\|}\mtrx{Q}) = \sum_{i=1}^{|\dataSet|} \sum_{j=1}^K p_{ij} \log \frac{p_{ij}}{q_{ij}}
    608 \end{equation*}
    609 This loss is minimized by backpropagating to the cluster centroids \(\vctr{\mu}_j\) and to the encoder's parameters in the \textsc{dae}.
    610 Note that the decoder of the \textsc{dae} is only used for initializing the encoder such that the input can be reconstructed.
    611 
    612 Optimizing \loss{ac} is the first step of Self\textsc{ore}; it assigns a pseudo-label to each sample in the dataset.
    613 The second step is to train a classifier to predict these pseudo-labels.
    614 The classifier is a simple multi-layer perceptron trained with the usual cross-entropy classification loss, which is called \loss{rc} in Self\textsc{ore}.
    615 This loss also backpropagate to the \bertcoder{} thus changing the sentence representations \(\vctr{h}\).
    616 Self\textsc{ore} is an iterative algorithm: changing the \(\vctr{h}\) modifies the clustering found by \textsc{dec}.
    617 Thus, the two steps, clustering and classification, are repeated several times until a stable label assignment is found.
    618 
    619 The central assumption of Self\textsc{ore} is that \bertcoder{} already produces a good representation for relation extraction, which, as we saw with the non-fine-tuned \bertcoder{} score on FewRel in Section~\ref{sec:relation extraction:mtb}, is rather accurate.
    620 However, Self\textsc{ore} also assumes \hypothesis{uniform}, i.e.~that all relations appear with the same frequency.
    621 This assumption is enforced by \loss{ac}, through the normalization of the target distribution \(\mtrx{P}\) by soft cluster frequencies \(f_j\).%
    622 \sidenote{For further details, \textcite{dec} contains an analysis of the \textsc{dec} clustering algorithm on imbalanced \textsc{mnist} data.}
    623 Indeed, the distribution \(\mtrx{P}\) is the original distribution \(\mtrx{Q}\) more concentrated (because of the square) and more uniform (because of the normalization by \(f_j\)).
    624 
    625 The interpretation of the concentration effect in terms of modeling hypotheses is more complex.
    626 The variable \(\vctr{h}\) is the concatenation of the two entity embeddings.
    627 Let's break down the \bertcoder{} function into two components: \(\operatorname{ctx}_1(s)\) and \(\operatorname{ctx}_2(s)\).
    628 These are simply the two contextualized embeddings of \texttt{<e1>} and \texttt{<e2>} (Section~\ref{sec:relation extraction:mtb}), in other words the function \(\operatorname{ctx}\) contextualize an entity surface form inside its sentence.
    629 When two sentence representations \(\vctr{h}\) and \(\vctr{h}'\) are close, their pseudo-labels tend to be the same, and thus their relation also tend to be the same.
    630 In other words:
    631 \begin{assumption}[ctxoneadjacency]{\ctxoneadj}
    632 	Two samples with the same contextualized representation of their entities' surface forms convey the same relation.
    633 
    634 	\smallskip
    635 	\noindent
    636 	\( \forall (s, \vctr{e}, r), (s', \vctr{e}', r')\in\dataSet_\relationSet\colon \)\\
    637 	\null\hfill\( \operatorname{ctx}_1(s)=\operatorname{ctx}_1(s') \land \operatorname{ctx}_2(s)=\operatorname{ctx}_2(s') \implies r=r' \)
    638 \end{assumption}
    639 If we assume \bertcoder{} only performs entity linking of the entities surface form, then \(\operatorname{ctx}_i(s)=e_i\) for \(i=1,2\), in this case \hypothesis{\ctxoneadj} collapses to \hypothesis{1-adjacency}, the contextualization inside the sentence \(s\) is ignored.
    640 On the other hand, if we assume \bertcoder{} provides no information about the entities and only encode the sentence, then \(\operatorname{ctx}_i(s)=s\) for \(i=1,2\) and \hypothesis{\ctxoneadj} only states that the entity identifiers \(\vctr{e}\in\entitySet^2\) should have no influence on the relation.
    641 The effective repercusion of \hypothesis{\ctxoneadj} lies somewhere half-way between these two extremes.
	PhD The LaTeX sources of my Ph.D. thesis
	git clone https://esimon.eu/repos/PhD.git
	Log \| Files \| Refs \| README \| LICENSE